Grammar-based Compression of DNA Sequences

نویسندگان

  • Neva Cherniavsky
  • Richard Ladner
چکیده

Grammar-based compression algorithms infer context-free grammars to represent the input data. The grammar is then transformed into a symbol stream and finally encoded in binary. We explore the utility of grammar-based compression of DNA sequences. We strive to optimize the three stages of grammar-based compression to work optimally for DNA. DNA is notoriously hard to compress, and ultimately, our algorithm fails to achieve better compression than the best competitor.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Searching for Compact Hierarchical Structures in DNA by means of the Smallest Grammar Problem

Motivated by the goal of discovering hierarchical structures inside DNA sequences, we address the Smallest Grammar Problem, the problem of finding a smallest context-free grammar that generates exactly one sequence. This NPHard problem has been widely studied for applications like Data Compression, Structure Discovery and Algorithmic Information Theory. From the theoretical point of view, our c...

متن کامل

Manning Inferring Sequential Structure Craig G . Nevill - Manning

Structure exists in sequences ranging from human language and music to the genetic information encoded in our DNA. This thesis shows how that structure can be discovered automatically and made explicit. Rather than examining the meaning of the individual symbols in the sequence, structure is detected in the way that certain combinations of symbols recur. In speech and text, these repetitions fo...

متن کامل

Inferring Sequential Structure

Structure exists in sequences ranging from human language and music to the genetic information encoded in our DNA. This thesis shows how that structure can be discovered automatically and made explicit. Rather than examining the meaning of the individual symbols in the sequence, structure is detected in the way that certain combinations of symbols recur. In speech and text, these repetitions fo...

متن کامل

Study On Universal Lossless Data Compression by using Context Dependence Multilevel Pattern Matching Grammar Transform

In this paper, the context dependence multilevel pattern matching(in short CDMPM) grammar transform is proposed; based on this grammar transform, the universal lossless data compression algorithm, CDMPM code is then developed. Moreover, it is proved that this algorithms’ worst case redundancy among all individual sequences of length n from a finite alphabet is upper bounded by ) log / 1 ( n C w...

متن کامل

Grammar Compressed Sequences with Rank/Select Support

Sequence representations supporting not only direct access to their symbols, but also rank/select operations, are a fundamental building block in many compressed data structures. In several recent applications, the need to represent highly repetitive sequences arises, where statistical compression is ineffective. We introduce grammar-based representations for repetitive sequences, which use up ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004